library(tidyverse)
## Warning: package 'tibble' was built under R version 3.6.2
library(visreg)
## Warning: package 'visreg' was built under R version 3.6.2
library(mgcv)
library(ISLR)
library(splines)
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

1. Select the response variable ‘wage’ and the predictors ‘age’, ‘year’, ‘education’, and ‘jobclass’ from the‘Wage’ data set. Examine which predictors show a non-linear relation with the response variable.

wdf <- Wage
# FItting linear regression model
m <- lm(wage~age+year+education+jobclass,wdf)
ggplot(wdf, aes(age, wage)) + 
    geom_point() + ggtitle("Scatter plot for Age and Wage")+
    geom_smooth(method = "lm", formula = y ~x)+geom_smooth(color="red")

ggplot(wdf, aes(year,wage)) + 
    geom_point() + ggtitle("Scatter plot for Wage and Year")+
    geom_smooth(method = "lm", formula = y ~x)+geom_smooth(color="red")
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.

visreg(m, "education",gg=TRUE)

visreg(m, "jobclass", gg=TRUE)

plot(m,1)

The age is indicating a non linear relationship with wage, Year is indicating linear pattern when plotted with wage. Education and job class is showing linear relationship across the categories. However, when we fitted a model with all predictors, residual vs fitted plots shows an evidence of slightly non-linear behaviour from the effect of age.

2 Fit a linear model with wage as the response and all selected predictors included. Fit this model with alinear predictor function using the mgcv::gam function. What is the average difference in wage betweena worker of 20 years old and a worker of 60 years old, holding all other things constant? Plot the partialresiduals for age. What can you say about the relation between wage and age?

m1 <- gam(wage ~ age+year+education+jobclass,data=wdf)
summary(m1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## wage ~ age + year + education + jobclass
## 
## Parametric coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.089e+03  6.483e+02  -3.222 0.001286 ** 
## age                          5.485e-01  5.719e-02   9.591  < 2e-16 ***
## year                         1.071e+00  3.233e-01   3.314 0.000931 ***
## education2. HS Grad          1.116e+01  2.473e+00   4.511 6.71e-06 ***
## education3. Some College     2.339e+01  2.614e+00   8.947  < 2e-16 ***
## education4. College Grad     3.834e+01  2.616e+00  14.655  < 2e-16 ***
## education5. Advanced Degree  6.275e+01  2.871e+00  21.859  < 2e-16 ***
## jobclass2. Information       4.575e+00  1.379e+00   3.318 0.000917 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.263   Deviance explained = 26.5%
## GCV = 1286.9  Scale est. = 1283.5    n = 3000
#partial Residual plot
visreg(m1,"age")

Partial plots suggests that the relationship of age with wage variable is linear.

for the difference between age 20 and 60 , it could be counted as 40(coef of ages) = 400.548 = 21.92

3 Produce suitable plots to check the residual diagnostics of the model in (2). Do you spot a problem?Revise the model and assess whether the fit improved.

par(mfrow = c(2,2))
gam.check(m1)

## 
## Method: GCV   Optimizer: magic
## Model required no smoothing parameter selectionModel rank =  8 / 8

The qqplot and histogram of residuals shows that the residuals are not normally distributed, which is the assumption of linear regression. To improve the model, we can take log of dependnet variable and run the model again.

#using log transformation to improve the model
m_revised <- gam(log(wage) ~ age+year+education+jobclass,data=wdf)

Improved model

# Improved m1 model
m2 <- m_revised
par(mfrow = c(2,2))
gam.check(m2)

## 
## Method: GCV   Optimizer: magic
## Model required no smoothing parameter selectionModel rank =  8 / 8

It can be observed from the qq-plots and histogram that taking the log of wage improved the model fit significantly. As, all the assumption of linear regression model are fullfilled, qqplot shows and histogram plot shows that the residuals are normally distributed.

4 Fit your revised model with a regression spline with 20 knots for age, a regression spline with 4 knotsfor year, and linear terms for the remaining selected predictors. Visualise the regression splines and comment on the results.

m3<- lm(log(wage) ~ ns(age,20)+ns(year,4)+education +jobclass,data=wdf)
summary(m3)
## 
## Call:
## lm(formula = log(wage) ~ ns(age, 20) + ns(year, 4) + education + 
##     jobclass, data = wdf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72890 -0.15628  0.01029  0.16827  1.10332 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.91062    0.06151  63.582  < 2e-16 ***
## ns(age, 20)1                 0.37329    0.06460   5.779 8.31e-09 ***
## ns(age, 20)2                 0.39297    0.08322   4.722 2.44e-06 ***
## ns(age, 20)3                 0.34655    0.07505   4.618 4.04e-06 ***
## ns(age, 20)4                 0.45854    0.07788   5.888 4.36e-09 ***
## ns(age, 20)5                 0.43635    0.08356   5.222 1.89e-07 ***
## ns(age, 20)6                 0.52892    0.08252   6.409 1.69e-10 ***
## ns(age, 20)7                 0.51121    0.07715   6.626 4.08e-11 ***
## ns(age, 20)8                 0.48661    0.07860   6.191 6.81e-10 ***
## ns(age, 20)9                 0.54916    0.07768   7.069 1.93e-12 ***
## ns(age, 20)10                0.44376    0.08392   5.288 1.33e-07 ***
## ns(age, 20)11                0.59027    0.07804   7.564 5.19e-14 ***
## ns(age, 20)12                0.44807    0.07593   5.901 4.02e-09 ***
## ns(age, 20)13                0.54939    0.08024   6.847 9.12e-12 ***
## ns(age, 20)14                0.42331    0.07832   5.405 7.00e-08 ***
## ns(age, 20)15                0.55262    0.07745   7.135 1.21e-12 ***
## ns(age, 20)16                0.45405    0.07623   5.957 2.88e-09 ***
## ns(age, 20)17                0.51748    0.07017   7.374 2.13e-13 ***
## ns(age, 20)18                0.38302    0.07573   5.058 4.49e-07 ***
## ns(age, 20)19                0.48202    0.14748   3.268 0.001094 ** 
## ns(age, 20)20                0.19592    0.10297   1.903 0.057171 .  
## ns(year, 4)1                 0.09556    0.02890   3.306 0.000957 ***
## ns(year, 4)2                 0.04256    0.02464   1.728 0.084177 .  
## ns(year, 4)3                 0.09242    0.03513   2.630 0.008571 ** 
## ns(year, 4)4                 0.06217    0.02001   3.108 0.001904 ** 
## education2. HS Grad          0.11076    0.02029   5.459 5.18e-08 ***
## education3. Some College     0.22322    0.02150  10.384  < 2e-16 ***
## education4. College Grad     0.33597    0.02152  15.613  < 2e-16 ***
## education5. Advanced Degree  0.49568    0.02359  21.016  < 2e-16 ***
## jobclass2. Information       0.03482    0.01129   3.083 0.002068 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2924 on 2970 degrees of freedom
## Multiple R-squared:  0.3157, Adjusted R-squared:  0.3091 
## F-statistic: 47.26 on 29 and 2970 DF,  p-value: < 2.2e-16
# Plt visualization

visreg(m3,"age",gg=T)+ggtitle("Age")

visreg(m3,"year",gg=T)+ggtitle("Year")

The plot of age shows that the log wages are increasing till 50 years of age and then there is slightly decreasing patteren.The Year plot shows that the wages are increasing lineearly by increarsing the year.

5 Fit your revised model with a penalized regression spline for age, and linear terms for the remaining selected predictors. Plot the smooth term with a confidence interval. What can you say about the relation between age and wage?

# penalized
m4 <- gam(log(wage) ~ s(age)+year+education+jobclass,method="REML",data=wdf)
summary(m4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(wage) ~ s(age) + year + education + jobclass
## 
## Parametric coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -19.990414   5.305136  -3.768 0.000168 ***
## year                          0.012161   0.002645   4.598 4.44e-06 ***
## education2. HS Grad           0.113547   0.020204   5.620 2.09e-08 ***
## education3. Some College      0.227946   0.021375  10.664  < 2e-16 ***
## education4. College Grad      0.338383   0.021429  15.791  < 2e-16 ***
## education5. Advanced Degree   0.498599   0.023516  21.202  < 2e-16 ***
## jobclass2. Information        0.035032   0.011258   3.112 0.001877 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value    
## s(age) 5.622  6.732 49.03  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.309   Deviance explained = 31.2%
## -REML = 599.61  Scale est. = 0.085461  n = 3000
anova(m4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(wage) ~ s(age) + year + education + jobclass
## 
## Parametric Terms:
##           df       F  p-value
## year       1  21.142 4.44e-06
## education  4 172.486  < 2e-16
## jobclass   1   9.683  0.00188
## 
## Approximate significance of smooth terms:
##          edf Ref.df     F p-value
## s(age) 5.622  6.732 49.03  <2e-16
par(mfrow = c(2,2))
gam.check(m4)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-0.0003521119,0.0002342288]
## (score 599.6133 & scale 0.08546121).
## Hessian positive definite, eigenvalue range [1.884817,1496.004].
## Model rank =  16 / 16 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##          k'  edf k-index p-value
## s(age) 9.00 5.62       1    0.46
visreg(m4,"age",gg=T)+ggtitle("Age")

Again the model is fitted regression splines. The model diagnostic shows that the assumptions are fulfilled and again we can we can see that the age plot showing almost similar behavior, as in the model with knots.

6 The relation between wage and age can be different across different job types. Include an interaction between the smooth term for age and jobclass to your model. Plot the interaction and provide an interpretation

m5 <- gam(log(wage) ~ s(age,by= jobclass)+year+education+jobclass,data=wdf)
summary(m5)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## log(wage) ~ s(age, by = jobclass) + year + education + jobclass
## 
## Parametric coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -19.957255   5.304042  -3.763 0.000171 ***
## year                          0.012145   0.002644   4.593 4.56e-06 ***
## education2. HS Grad           0.112779   0.020219   5.578 2.65e-08 ***
## education3. Some College      0.227836   0.021387  10.653  < 2e-16 ***
## education4. College Grad      0.338561   0.021452  15.783  < 2e-16 ***
## education5. Advanced Degree   0.499573   0.023529  21.232  < 2e-16 ***
## jobclass2. Information        0.035283   0.011262   3.133 0.001747 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf Ref.df     F p-value    
## s(age):jobclass1. Industrial  4.635  5.673 37.62  <2e-16 ***
## s(age):jobclass2. Information 4.340  5.355 21.62  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.309   Deviance explained = 31.3%
## GCV = 0.085919  Scale est. = 0.085461  n = 3000
# Pl0t dignostic
par(mfrow=c(2,2))
gam.check(m5)

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 5 iterations.
## The RMS GCV score gradient at convergence was 4.751158e-07 .
## The Hessian was positive definite.
## Model rank =  25 / 25 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                                 k'  edf k-index p-value
## s(age):jobclass1. Industrial  9.00 4.63       1    0.42
## s(age):jobclass2. Information 9.00 4.34       1    0.51
# Interaction plot
visreg(m5,"age",by= "jobclass",gg=T, overlay = T)+ggtitle("interraction of age and jobclass")

Model diagnostics not suggesting any problem in the model. However, we can see that the interaction plots. The plot of age with industrial jobs shows that the wages are increasing from the age of 20 and start decreasing from the age of 50. The interaction plot of age with Information shows that the wages are increasing till 40 years of age and then slightly decreasing.

7 Test whether the interaction term is signficicant. Revise your model accordingly and test whether your revised model contains non-linear terms

For this purpose we applied the function of anova on the model, whic provides the P-values of the coefficents.

anova(m5,m4,test = "F")
## Analysis of Deviance Table
## 
## Model 1: log(wage) ~ s(age, by = jobclass) + year + education + jobclass
## Model 2: log(wage) ~ s(age) + year + education + jobclass
##   Resid. Df Resid. Dev      Df Deviance      F Pr(>F)
## 1    2982.0     255.02                               
## 2    2985.9     255.31 -3.9532  -0.2869 0.8492 0.4928
#without interraction

revised_m5 <- gam(log(wage) ~ s(age)+year+education,data=wdf)

The P-value is: 0.4928, greater than 0.05. So, according to our criterion, we would not reject the null hypothesis, which suggests that the interaction terms is not necessary and the model without interaction is better, in this way.

revised <- gam(log(wage) ~ s(age)+year+education,data=wdf)

#test for non-linear

anova(m2,revised,test = "F")
## Analysis of Deviance Table
## 
## Model 1: log(wage) ~ age + year + education + jobclass
## Model 2: log(wage) ~ s(age) + year + education
##   Resid. Df Resid. Dev     Df Deviance      F    Pr(>F)    
## 1    2992.0     272.46                                     
## 2    2987.6     256.18 4.4034   16.282 43.139 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is really small, less than 0.05. So, according to our criterion, we would reject the null hypothesis, which suggests that there is non-lienarity thus the model with smoothing function is better

8 Use your model to predict the wage of an industrial job in 2009 for a worker with HS grad education,of age 20 and age 60.

d1 <- data.frame(year=2009,education="2. HS Grad",age=20,jobclass="1. Industrial")

# Predicting Age 20
predict.gam(revised,d1,se=T)
## $fit
##        1 
## 4.205112 
## 
## $se.fit
##          1 
## 0.02797235
# Predicting Age 60
d2 <- data.frame(year=2009,education="2. HS Grad",age=60,jobclass="1. Industrial")

# Predicting Age 60
predict.gam(revised,d2,se=T)
## $fit
##        1 
## 4.605911 
## 
## $se.fit
##          1 
## 0.01895458

From the prediction we know that the predicted log(wage) for age 20 is 4.20 and log(wage) for age 60 is 4.60, removing the log for both of them get us the number of wage, for age 20 = e^4.20 = 66 wage, for age 60 = e^4.60 = 99